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Abstract 

We consider a problem of risk estimation for large-margin multi-class classifiers. We propose a novel 
risk bound for the multi-class classification problem. The bound involves the marginal distribution 
of the classifier and the Rademacher complexity of the hypothesis class. We prove that our bound is 
tight in the number of classes. Finally, we compare our bound with the related ones and provide a 
simplified version of the bound for the multi-class classification with kernel based hypotheses. 
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1 Introduction 


The principal goal of the statistical learning theory is to provide a framework for studying the problems 
of a statistical nature and characterize the performance of learning algorithms in order to facilitate the 
design of better learning algorithm. 

The statistical learning theory of supervised binary classification is by now pretty well developed, 
while its multi-class extension contains numerous statistical challenges. Multi-class classification problems 
widely arise in everyday practice in various domains, ranging from ranking to computer vision. 

For binary classification problems a quite good distribution-free characterization of risk bounds is 
given via VC dimension. Tighter data-dependent bounds are known in terms of Rademacher complexity 
or covering numbers. These bounds correctly describe a finite sample performance of learning algorithms. 

Bounding classification risk for multi-class problems is much less straightforward. Recently, finite sam¬ 


ple performance of multi-class learning algorithms was given by means of Natarajan dimension (Daniely 


and Shalev-Shwartz 2014, Daniely, Sabato, Ben-David, and Shalev-Shwartz 20111. An interesting VC- 


dimension based bound for the risk of large margin mutti-class classifiers is provided in (Guermeur 2007). 


These estimates give a quite tight data-independent bound on the risk of multi-class classification 
methods. On the other hand data-dependent characterization of algorithm quality usually give much 
better estimates for practical problems. 
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Rademacher complexity bounds seem to be one of the tightest way to estimate data-dependent finite- 


20031. There is a lot of progress in risk estimation for binary classification problems (Bartlett, Bousquet, 


sample performance of learning algorithms (Koltchinskii and Panchenko 2002, Bartlett and Mendelson 


and Mendelson 2005, Boucheron, Lugosi, and Massart 20131. 


For multi-class learning problems the situation is more delicate. A seminal paper of Koltchinskii & 


Panchenko (Koltchinskii and Panchenko 20021 provides Rademacher complexity based margin risk bound. 


The main drawback of this bound is a quadratic dependence on the number of classes, which makes the 
bound hardly applicable to real-life huge-scale problems of computer vision or text classification. In spite 


of numerous research there was only a slight improvement of this bound (Mohri, Rostamizadeh, and 


Talwalkar 2012 Cortes, Mohri, and Rostamizadeh 20131. 


Contribution. The main contributions of this paper are 

a) a new Rademacher complexity based bound for large-margin multi-class classifiers. The bound is lin¬ 
ear in the number of classes which improves quadratic dependence of formerly the best Rademacher 


complexity bounds (Koltchinskii and Panchenko 2002 Cortes, Mohri, and Rostamizadeh 20131; 


b) a new lower bound on the Rademacher complexity of multi-class margin classification methods. 
This means that sub-linear in the number of classes Rademacher complexity based bound is hardly 
possible for multi-class margin classifiers in a standard (unconstrained) model. But it is still possible 
to provide better bounds in terms of their dependence on the number of classes under other models 
or extra assumptions (Allwein, Schapire, and Singer 2001] Dietterich and Bakiri 1995, Zhang 20041. 


Paper structure. The paper consists of four parts. In the second part of the paper, we present the 
theoretical contribution, namely new Rademacher complexity bounds. It is followed by a discussion of 
related works and comparison the proposed bound with other multi-class complexity bounds. 


2 Multi-class learning guarantees 

We consider a standard multi-class classification framework. Let A be a set of observations and y, 
|3^| < 00 be a set of labels respectively. Let (A x y,A,P) be a probability space and let be a class 
of measurable functions from (A, A) into M. Let {{xi,yi)} be a sequence of i.i.d. random variables 
taking values in (A x (V, A) with common distribution P. We assume that this sequence is defined on a 
probability space (fl, S, P). Let P„ be the empirical measure associated with the sample S = {{xi, 2 /i)}(Li. 

We assume that the labels take values in a finite set y with |(y| = k. Let A be a class of functions 
from S into M. A function f G P predicts a label y G (V for an example a; G 5* iff 

f{x,y)>maxf{x,y') (1) 

y^y 

The margin of a labeled example (x, y) is defined as 

'mf{x,y) := f{x,y) -maxf{x,y'), (2) 

V=FV 

SO / misclassifies the labeled example {x,y) iff mf{x,y) < 0. 

Let 




In a more common situation all scoring function belongs to same class P. 
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We refer to the empirical Rademacher complexity of the class T as 


= Eg sup - 


i=l 


where ei,... ,e„ is independent {±l}-valued random variables. Then the Rademacher complexity of T 
is =E«„(.F). 


The following theorem states an upper bound for the classification error of fc-class classifier. This 


result improves theorem 11 of (Koltchinskii and Panchenko 2002], theorem 1 of ( 

Cortes, Mohri, and 

Rostamizadeh 2013 

1 and theorem 8.1 of (Mohri, Rostamizadeh, and Talwalkar 2012 

1 by a factor of k. 


Theorem 1. For all t > 0, 


3f G T : P{mf < 0} > 


inf 

56 ( 0 . 1 ] 


AJc 

Pn{mf < 5} + — 


^ log log2(2/J) ^ 


1/2 


+ 


< 2exp(—2t^) 


Later we show that theorem [T] give a tight bound on the multi-class complexity. 
Let MkiJ'i, • ■ ■ be a class of functions such that 


MkiPi,---,Pk) = {VmGMk = f{x,y) - f{x,y'), f{x,y) G Py}. (3) 

vGv 


Prior to the proof of the theorem one needs to proof the following lemma. 

Lemma 1. Let ■ ■ ■, Pk) be a class of margin functions over Pi,, Pk defined in ??. Then for 

any i.i.d. sample Sn = {(iCz, of size n holds 


k 

^n{Mk{Pi,...,Pk))<Y.^A^j)- 

i=i 


Proof. We provide a proof of the lemma in the case P = Pi = ■■■= Pk- It can be easily extended into 
a more general case. For a single class P the class of margin functions M.k{P) has a form 


Mk{P) = {Vto e Mk ■■ m{x,y) = f{x,y) - max/(a;, y')}- 

y^y' 

Let mf{x,y)^^ be a partial margin of the object (x,y) taken with respect to the subset y' of the set 
of classes, y' C y-. 

f/(a:,y) - max/(a;,y'), if y G X 

mf{x,y)^^ = < y'^y 

- max/(a;,y')) if y ^ 3^ 

I ypy 

Let (P) = {Vm G (P) : m = \xi,yi),f G P]. 

The proof is by induction on the size of y'. Note that {P) = M.k{P) and 


M\}^{x,y) 


f{x,y), ify=l 
-f{x,y), if y 7^1 
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Denote by 5{y, y') the indicator oi y = y' 


if y = y' 

if 2 / 2 /' 


5 {y,y') = 


Then for y' = {y} holds 


- n 1 ^ 

= Ee sup - ei{2S{yi,y) - l)f{xt) = E^ sup - eif{xi) = 

because a binary sequence S{yi,y) is independent of the class of functions and the Rademacher vari¬ 
ables {ei}2^i- Therefore, the induction base is proved. 

The induction hypothesis is that for any y' cy,\y'\ <t the Rademacher complexity of satisfies 




(4) 


If 32 ' = 3 ; the statement is proved, otherwise the set 3^\3^' is not empty. Then for any y G y\y' and 
i.i.d. sample S = {(a;^,holds 

= Ee sup-( V ei{f{xi,y^)-maxf{xi,y)} 

- V e^max{f{xi,y),maxf{x,,y)}\ 


yi=y 


y^yi 


ixi,yi)eS 

yi=Ay 


Note that max{/i, /a} = -f 
Then 

^*'(-^)) = Ee sup-| V ei{f{xi,yt)-maxf{xi,y))- 


'{xi,yi)£S 

Vi^y 


b eo/(2^i>2/) + max/(a;i,y))- 

I y^y 


ixi,yi)eS 

Vi=Ay 


fixi,y) - xaaxf{xi,y) 
y&y 


< 


^ n 1 

Ee sup — ^ ei{2d{yi, y) - l)f{x^, y) + E^ sup — ^ £^(1 - 25{yi, y)) max/(a;i, y)+ 




2 = 1 




2=1 


f{x^,y) - max f{xi,y) 
y&y 


Ee sup ^'^ei\6{y,,y){f{xi,y) - maxf{x„y)) -f (1 - 6{yi,y)) 

+ E, sup ^ eJsiy., y)(/(x„ y) - max/(a:„ y)) 


f&r 2 n 


y&y 


2 2 

•> 1—1 '' 

+ (1 - 5{yi,y)) 

Note, that x + y —i' x + \y\ is a 1-Lipschitz. Thus by Talagrand’s contraction inequality (see the¬ 
orem 4.12, p. 112-114 of (Ledoux and Talagrand 19911 and more appropriate lemma 4.2, p. 78-79 of 


f{x^,y) - max f{xi,y) 
y&y 
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(Mohri, Rostamizadeh, and Talwalkar 2012 1 ) holds 

1 


Ee sup — ^ei(25{yi,y) - - max/(a;i,?/)) < 


2 n 


+ Ee sup — ^ e^(2S(yi, y) - l)f{xi, m)+ 


2 jgjF 2n 


Ee sup ^y^£i(l - 26{yi,y))maxf{xi,y) = 


i=l 


/e^ 


2n 


v^y 


^n{T) {F)) < (lyi + 


where the last but one inequality holds by the inductive hypothesis, ineq. ??). This completes the 
inductive proof. □ 


Proof of the theorem^^ Following to (Koltchinskii and Panchenko 2002 1 consider 2 sequences {5j}j>i 
and Sj G (0,1). 


The standard Rademacher complexity margin bound (theorem 4.4, p. 81-82 of (Mohri, Rostamizadeh, 
and Talwalkar 2012[|) gives for any fixed St and e*: 


P{mf{x,y) < 0) - Pn{mf{x,y) < St) > —2l\{Mt{F)) + > < exp(-2ny). 


Then by choosing Sj = and applying the union bound 


3 j : P{mf{x,y) < 0) - Pn{mf{x,y) < Sj) > —D\{Mk{F)) + £j 

^3 


< exp(— 2 n£j) < exp(— 2 t^) y^exp(— 21 og j) = — exp(— 2 nt^) < 2 exp(— 2 nt^). 


We choose Sk = ^I2f, then 2/(5j < 4/5. By lemma we have 91(Al/c(J^)) < kfR{T) which proofs the 
theorem. □ 

Below we present a Rademacher complexity bounds for multi-class kernel learning in a simplified 
form. Let j^Kbea positive definite symmetric kernel and $ : df —?> El be a feature mapping 

associated to In the multi-class setting a family of kernel-based hypotheses T-Lk,p is defined for any 
p > 1 as 

= {{x,y) &X xy -^Wy- $(a;) : W = (iCi,..., Wk)'^, ||kF||H.p < A}, 

where ||VF||n = The labels are assigned according to argmax(ic„, $(a;)). 

yey 

The following bound is a corollary of the theorem 

Theorem 2. Let XxX^^ he a positive definite symmetric kernel and let ^ : X ^ M be the 
associated feature mapping function. Assume that there exists R > 0 such that A.{x, x) < for all 
X € X. Then, for any t > 0 the following multi-class classification generalization bounds hold for all 
hypotheses h G Eljf^p 


3/ G J": P{mf < 0} > Pn{mf < 5} 3- 


2k /i? 2 A 2 t 


H— 1 = < exp(- 2 r 

n 
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Below we proof that the bound on the Rademacher complexity of the class ... ,Fk) is 

tight. Let Fl = {/ : K —>■ [—1; +1]} be a class of functions such that 


n 3 /(x) 


- 1 , ifa;^[j;j + l] 

+1 or - 1, if a; G [j;j + 1] 


and moreover each / G has in (j,j + 1) no more than t discontinuity points. We refer to J-q as the 
class of functions takes —1 over real line. 

Denote 

m 

Tl = {max{/i, / 2 ,..., M, /, G n } and = \J :Fl. 

i=i 

Note, that all the classes }j^i, Tt and {Jy } for a fixed t satisfy the conditions of the central limit 
theorem. 

Let ) be a Rademacher complexity of JF^ defined with respect to the interval {j,j + 1) only 




1 . ^ 

sup -y^g»/(Xi)la;,g(j,j- + i). 

” i=i 


Lemma 2. Let be a uniform distribution over the domain X = [1; fc + 1]. Then for any C > 0 there 
exists t = t{C,k) sueh that for any sample Sn = {xi}2=i of size n drawn i.i.d. from and any j, 
1 < j < k holds 


miiPi) > cmniXo) 


since n > no, uq = no(t). 


Proof. By theorem 5.3.3. of (Talagrand 20141 for any sequences ti,... ,tm in such that 


£ £' => \\ti — ti' II > a 


and 

V£ < m => ||t^||oo < b 

the following lower bound for Rademacher process holds 

n ^ I' ^2 'I 

Ee sup V' f{xi)ei > — min ayjlogm, — L (5) 

^ I H 

for some absolute constant L. 

By the standard chaining argument the Rademacher complexity of the class Po satisfies 

for some absolute constant Co = Co{P) > 0 independent of n. 

Let objects ,..., a;"j belong to {j,j + 1) are ordered in such a way that (a;* — x^){i — j) > 0 for all 
i,j. Note that for any such sequence there exist functions {/i,..., / 2 L"j/tJ } G Pt+i such that the function 
fj assigns +1 to objects s : 1 < s < [rij/tj iff a binary representation of j contains 1 

in s-th digit from the right. Otherwise it assigns to —1 to ... ,a:®‘+*}. 

Then by the equation ?? the following lower bound on Rademacher complexity of the class Pj, 
Pj = {fi, ■■■, / 2 L- 3 /‘J } takes place 


1 " 

Ee sup - y^ £if{Xi)l, 




n y rij ’ n j ’ 


/G-?a 
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for some absolute constant L stated by the inequality ??. 

Remind that the median for Binomial distribution with parameter 1/A: is one of the integers {[n/fcj — 
1, \n/k\, \n/k\ + 1}. Then the number of objects in {j,j + 1) is n/k — 2 or more with probability at 
least 1/2. 

Therefore, if n > 16kt^, t > 1 


^n{^j) = EEe sup - f(x^)ei > Y min \^\ 2 

L [ 2A: V 


Then it is sufficient to choose t > CoCLVk/2 and n > 16kt^ as above to satisfy the conditions of the 
lemma. □ 

Theorem 3. Let be a uniform distribution over the domain T = [l;fc + 1] and eoncentrated on 
a single class k + 1. Then for any sample Sn = of size n drawn i.i.d. from P^ x P^ and 

any e > 0 for the Rademacher complexity of the margin class Mk+i = {Pf,...,P^+\Po) holds 

k 

^n{Mk+l) > (l-e)^lHn(J-/) 

for some large enough t = t{e, k) independent of n and all n > uq, uq = no(t). 

Proof. By the symmetry under negation of classes Pf in (j, j + 1) and definition of the class Pf we have 

^n{p;) = > (l - = fc (l - d\n{Pi), j':l < 3 <k 

1=1 ^ ^ 1=1 ^ ^ 

where C = 1/e is defined in accordance with the lemma 

Note that the Rademacher complexity oiPik+i {P} i i j ) is at least the same as the Rademacher 
complexity of Pf by the construction of Mk+i and pf ,..., This proofs the lemma. □ 

A similar bound holds for the Rademacher complexity of the classes Pt and J\4k+i{Pt) respectively. 
Note, that this bound is effectively the lower bound to the estimate of the theorem [l] in the sense that 
the bound there can not improved based on the Rademacher complexity estimates only if one put no 
assumptions on the behavior of the function class (e.g. small covering number bound or small VC 
dimension). 


4tk 2t 1 
n ' y/nk j 


> 




2t 




2t 


\2kL’L^/nk j L\fnk 


3 Related works and discussion 


A number of works are devoted to bounding the risk of multi-class classification methods. One popular 
approach to solving a problem with multiple classes is to reduce it to a sequence of binary classification 
problems. In terms of risk dependence on the number of classes a great breakthrough was done with 


the design of error-correcting output codes (ECOC) for multi-class classification ( 

Dietterich and Bakiri 

1995 

Allwein, Schapire, and Singer 2001 

Beygelzimer, Langford, and Ravikumar 2009 

1 . 


In spite of some very promising results concerning ECOC Rifkin & Klautau argued in (Rifkin and 


Klautau 20041 that the classical approaches, such as one-vs-all classification, is at least as preferable as 


error-correcting codes from the practical point of view. 
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Another approach is to define a score function on the point-label pairs and choose a label with the 
highest score (one-vs-all classification method can be considered from this point of view as well). It is 
natural to characterize the risk bounds of these methods in terms of classification margin S equals to the 
gap between the highest score and the second highest score (see def. ?? for details). 


Multi-class SVM extension. Among the methods that share scoring-based paradigm, one should 
mention the Weston & Watkins multi-class extension of SVM ( Weston and Watkins 199^ . An improved 
version multi-class SVM as well as the improved margin risk bound of the order 0{k^/nS"^) were presented 
by Crammer & Singer in (Crammer and Singer 2002b, Crammer and Singer 2002a[|. 


Rademacher complexity bounds. Currently Rademacher complexity as well as combinatorial di¬ 
mension estimates seem to be among of the most powerful tools to get strong enough risk bounds for 
multi-class classification. The important property of Rademacher complexity based bounds is that the 
bounds are applicable in arbitrary Banach spaces and do not depend on the dimension of the feature 
space directly. 


Koltchinksii & Panchenko introduced a margin-based bound for multi-class classification in terms of 


Rademacher complexities ( 

Koltchinskii and Panchenko 2002 

Koltchinskii, Panchenko, and Lozano 20011. 

The bound was slightly improved (by a constant factor prior to the Rademacher complexity term) in a 

series of subsequent works 

Mohri, Rostamizadeh, and Talwalkar 2012 Cortes, Mohri, and Rostamizadeh 


20131. 


The main drawback of these state-of-the-art bounds for multi-class classification is a quadratic de¬ 
pendence on the number of classes which makes the bounds unreliable for practical problems with a 
considerable number of classes. 

The principal contribution of this paper is a new Rademacher complexity based upper bound with 
a linear complexity w.r.t. the number of classes. Moreover we provide the lower bound on Rademacher 
complexity of margin-based multi-class algorithms. Up to a constant factor it matches to the upper 
bound. Than means that the bound can not be improved without further assumptions. 


Covering number based bounds. Zhang in (Zhang 2004 Zhang 20021 studied covering number 


bounds for the risk of the multi-class margin classification. Based on the covering number bound 
estimate for the Rademacher complexity of kernel learning problem he obtained asymptotically better 
rates in the number of classes k (see tab. 1) than those proposed in our paper. 

Note, that Zhang’s analysis is based on some extra assumptions (not really too restrictive) about 
underlying hypothesis class and the loss function used. We suppose that the results of (Zhang 20041 
are appreciated from the theoretical point of view but still quite limited for practice. This is due to 
high overestimate (from a practical perspective) of the Rademacher complexity of the hypothesis class 
by a £°° covering number based bound. It should also be noted that Zhang’s bound are valid only for 
learning kernel-based hypothesis and have some extra poly-logarithmic dependence on the number of 
labeled examples. 


Related results for metric spaces with low doubling dimension were obtained by Kontorovich (Kon- 


torovich and Weiss 20141, who used nearest neighbors method to improve the dependence on the number 


of classes in favor of (doubling) dimension dependence. We should note as well that his approach allows 
to speed-up multi-class learning algorithms. 

We gather margin based bounds applicable for learning functions in Hilbert space the tab. 1. 


Combinatorial dimension bounds. Natarajan dimension was introduced in (Natarajan 19891 in 


order to characterize multi-class PAC learnability. It exactly matches the notion of Vapnik-Chervonenkis 
dimension in the case of two classes. A number of results concerning risk bounds in terms of Natarajan 
























Upper bound, O(-) 

Paper 

5y/n 

Koltchinskii & Panchenko, ( 

Koltchinskii and Panchenko 2002 

Cortes et ah. 

(Cortes, Mohri, and Rostamizadeh 2013|), 



Mohri et al. 

Mohri, Rostamizadeh, and I'alwalkar 2012 

1 


Guermeur, (C 

juermeur 2010 

) 



1 Fk 

6 y n 

Zhang, ( 

Zhang 2004 





Al 

(5^n 

Crammer & Singer, 

Crammer and Singer 2002b 

1 


k 

6y/n 

this paper 







Table 1: Dimension-free margin-based bounds for multi-class classification. 


dimension were proved in (jDaniely, Sabato, Ben-David, and Shalev-Shwartz 2011 

Daniely and Shalev- 

Shwartz 2014| |Ben-David, Cesabianchi, Haussler, and Long 1995 

Daniely, Sabato, and Shalev-Shwartz 

20121. A closely related but more powerful notion of graph dimension was introduced in 

Daniely, Sabato, 

Ben-David, and Shalev-Shwartz 2011 

Daniely and Shalev-Shwartz 2014 

|. VC-dimension based bounds 

for multi-class learning problems were obtained in ( 

Allwein, Schapire, and Singer 2001 



Natarajan and graph dimensions are very useful tools for obtaining multi-class classification risk 
bounds. The main drawback of these bounds is that they are data-independent. In this sense, we believe 
that the bounds proposed in this paper are much stronger than the Natarajan/graph dimension bounds 
same as that of Rademacher complexity bounds are stronger than the VC dimension bounds for binary 
classification. 

We also note that VC dimension bounds as well as Natarajan dimension bounds are usually dimension 
dependent ( Daniely and Shalev-Shwartz 2014[ |, which makes them hardly applicable for practical huge 
scale problems (such as typical computer vision problems). 

Guermeur in (Guermeur 2007 Guermeur 20101 gave a bound for scale-sensitive analog of Natarajan 
dimension (In at- In Hilbert space for a class of linear functions it can be bounded in terms of the margin 
as which leads to the risk decay rate of the order 0(k/5^^/n) (see tab. 1). 

We gather the bounds above in the tab. 2. Note, that the bound of the order 0{dNat/'n) is valid in 
a separable case only. A clear comparison between various multi-class classification methods is provided 


Upper bound, 0{-) 

Paper 

log k / dvc 

S y n 

Allwein et ah, ( 

Allwein, Schapire, and Singer 2001 


log/c / dfiat 

5 y n 
d-Nat 
n 

Guermeur, (Guermeur 20101 


Daniely et ah. 

Daniely and Shalev-Shwartz 20141 



Table 2: Combinatorial dimension based upper bounds for multi-class classification. 


in (Daniely, Sabato, and Shalev-Shwartz 20121. Lower bounds on Natarajan dimension and sample 


complexity of multi-class classification methods provided in (Daniely, Sabato, Ben-David, and Shalev- 

Shwartz 2011 

Daniely and Shalev-Shwartz 2014|. It was shown in (Daniely, Sabato, Ben-David, and 

Shalev-Shwartz 2011 

Daniely and Shalev-Shwartz 20141 that for multi-class linear classifiers the bounds 


on Natarajan dimension can be as poor as where d is a feature space dimension and k are a number 

of classes. In this work we provide a linear (in the number of classes) lower bound on the Rademacher 
complexity of the multi-class margin class of functions (see th.j^for details). 

A preliminary version of the upper bounds (theorem with slightly poor dependence on k was 
presented by the first author in context of semi-supervised multi-class classification on the workshop 
“Frontiers of High Dimensional Statistics, Optimization, and Econometrics” in February 2015. The risk 
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bounds stated in this paper were presented in the final form on March 25-th at the main seminar of 
Institute for Information Transmission Problems (IITP RAS). In July 2015 the authors were notified be 
their colleagues that similar results were proposed i ndependently by Kuznetsov et al. and presented on 
ICML Workshop on Extreme ClassificationQ and in (Kuznetsov, Mohri and Syed 20141. Still we suppose 
that the bounds presented in this paper are much stronger than the ones presented by Kuznetsov et al. 
in the sense that we prove explicit lower bounds as well. This shows that the bound which we proved in 
theoremis tight, i.e. linear dependence on the number of classes is inevitable if no further assumptions 
are made. 


4 Conclusion. 

In this paper we propose new state-of-the-art Rademacher complexity based upper bounds for the risk 
of multi-class margin classifiers. The bound depends linearly in in the number of classes. We prove as 
well that the bound can not be further improved based on the Rademacher complexities only. Still it is 
possible to provide a better estimates for the excess risk of multi-class classification using other techniques 
or supplementary assumptions. 
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